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Exploiting Feature and Class Relationships in 
Video Categorization with Regularized Deep 

Neural Networks 
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Abstract —In this paper, we study the challenging problem of categorizing videos according to high-level semantics such as 
the existence of a particular human action or a complex event. Although extensive efforts have been devoted in recent years, 
most existing works combined multiple video features using simple fusion strategies and neglected the utilization of inter-class 
semantic relationships. This paper proposes a novel unified framework that jointly exploits the feature relationships and the class 
relationships for improved categorization performance. Specifically, these two types of relationships are estimated and utilized by 
rigorously imposing regularizations in the learning process of a deep neural network (DNN). Such a regularized DNN (rDNN) can 
be efficiently realized using a GPU-based implementation with an affordable training cost. Through arming the DNN with better 
capability of harnessing both the feature and the class relationships, the proposed rDNN is more suitable for modeling video 
semantics. With extensive experimental evaluations, we show that rDNN produces superior performance over several state-of- 
the-art approaches. On the well-known Hollywood2 and Columbia Consumer Video benchmarks, we obtain very competitive 
results: 66.9% and 73.5% respectively in terms of mean average precision. In addition, to substantially evaluate our rDNN and 
stimulate future research on large scale video categorization, we collect and release a new benchmark dataset, called FCVID, 
which contains 91,223 Internet videos and 239 manually annotated categories. 

Index Terms —Video Categorization, Deep Neural Networks, Regularization, Feature Fusion, Class Relationships, Benchmark 
Dataset. 
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1 Introduction 

V IDEOS carry very rich and complex semantics, 
and are intrinsically multimodal. Techniques for 
recognizing high-level semantics in diverse uncon¬ 
strained videos can be deployed in many applications, 
such as Internet video search. However, it is well- 
known that semantic recognition or categorization of 
videos is an extremely challenging task due to the 
semantic gap between computable low-level video 
features and the complex high-level semantics. While 
significant progress has been achieved in recent years, 
most state-of-the-art solutions rely on a large set of 
features with simple fusion strategies to model the 
high-level semantics. For instance, two popular ways 
of combining multiple video features are early fusion 
and late fusion. Early fusion concatenates all the 
feature vectors into a long representation for classi¬ 
fier training and testing, while late fusion trains a 
classifier using each feature separately and combines 
the outputs of all the classifiers. Both methods do 
not have the capability of explicitly modeling the 
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correlations among the features, which can be ex¬ 
ploited to achieve a better representation. In addition, 
the existing categorization methods often neglected 
the relationships of different semantic classes, which 
can be exploited to boost the categorization perfor¬ 
mance. Although there exist several works investigat¬ 
ing multi-feature fusion or exploiting the inter-class 
relationships, as will be discussed in the next section, 
they mostly address the two problems separately. 

Motivated by the limitations of the existing tech¬ 
niques and the increasing popularity of using Deep 
Neural Networks (DNN) for visual categorization, in 
this paper we propose a novel unified framework that 
jointly learns the feature relationships and the class 
relationships using a DNN. Video categorization can 
also be carried out within the same network utilizing 
the learned relationships. 

Figure gives an overview of the proposed ap¬ 
proach. We first extract several popular video features, 
including the popular frame-based features computed 
by the convolutional neural networks (CNN) 111 , 
trajectory-based motion descriptors and audio de¬ 
scriptors. The features are then used as the inputs of a 
DNN, where the first two layers are input and feature 
transformation layers, respectively. The third layer is 
called fusion layer, where we impose structural regu¬ 
larization on the network weights to identify and uti¬ 
lize the feature relationships. Specifically, the regular¬ 
ization terms are selected based on two natural prop- 
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Fig. 1. Illustration of the proposed rDNN framework for video categorization. Various visual and audio features 
are first extracted and then input into the rDNN. The features are first transformed separately using one layer 
of neurons. On the fusion layer, regularization on the network parameters is imposed to ensure that different 
features can share correlated dimensions while preserving some unique characteristics. As shown in the figure, 
some dimensions of different features are correlated (indicated by the thick lines pointing to the same neuron). 
After that, the parameters between the fusion and the output layers are also regularized to discover groups of 
categories. Both the learned feature and class relationships are useful for improving the final performance. 


erties of the inter-feature relationships: correlation and 
diversity. The former means that different features 
may share some common patterns in a middle level 
representation lying between the original features and 
the high-level semantics (i.e., the transformed fea¬ 
tures after the second layer). The latter emphasizes 
the unique characteristics of different features, which 
are the complementary information that is likely to 
be useful for identifying video semantics. Through 
modeling both properties using a feature correlation 
matrix, we impose a trace-norm regularization over 
the network weights to reveal the hidden correlations 
and diversity of the features. 

In order to discover and utilize the inter-class 
relationships, we impose similar regularizations on 
the weights of the final output layer. This helps to 
identify the grouping structures of video classes and 
the outlier classes. Semantic classes within the same 
group share commonalities that can be utilized as 
knowledge sharing for improved categorization per¬ 
formance, while the outlier classes should be excluded 
from "negative" knowledge sharing. By executing 
regularized learning of the two kinds of relationships 
within the same DNN, we arrive at a unified frame¬ 
work, namely regularized DNN (rDNN), which can be 
easily implemented using a modern GPU. 

Although the framework shown in Figure is built 
on the static CNN feature and a few typical hand¬ 
crafted video features, it is feasible to use our ap¬ 
proach for fusing any features. We also realize that, 
in the image categorization domain, the CNN fea¬ 
tures are dominating state-of-the-art approaches. The 


reasons of considering both the CNN feature and the 
hand-crafted features in this work are two-folds. First, 
the hand-crafted features have been widely used for 
video categorization and remain the key components 
of many systems that generated recent state-of-the- 
art results on tasks like human action recognition 121 
and complex event recognition |3lJ, By using these 
features it is easy to make comparisons with the tra¬ 
ditional approaches. Second, so far, no existing work 
on neural networks based video feature extraction 
has demonstrated significantly better results than the 
traditional hand-crafted features. Two recent works 
only showed lower or similar results |5|, ll6|. There¬ 
fore, this paper considers both the deeply learned and 
the hand-crafted features, and focuses on the tasks of 
feature fusion and semantic categorization. 

The main contribution of this paper is the proposal 
of the rDNN. To the best of our knowledge, rDNN 
is the first attempt to exploit both the feature and 
the class relationships in the DNN pipeline for video 
categorization. Our formulation is designed to model 
the complex relationships such as feature/class cor¬ 
relation and diversity. It is easy to implement and 
can be efficiently executed using a GPU. In addition, 
realizing the fact that the existing datasets for video 
categorization are small (e.g., the UCFIOI IZl) or lack 
accurate manual labels (e.g., the DeepSports 13), we 
introduce and release a new benchmark dataset, called 
Fudan-Columbia Video Dataset (FCVID). FCVID con¬ 
tains 91,223 YouTube videos and 239 manually an¬ 
notated categories. It is one of the largest manually 
annotated datasets of Internet videos. We evaluate 
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rDNN using this new dataset, and hope that its 
public release could stimulate future research around 
this challenging problem. This work extends upon a 
conference publication |3 by adding more detailed 
discussions throughout the paper, introducing the 
FCVID dataset, conducting new experiments with the 
CNN feature, and comparing with more alternative 
methods. 

The rest of this paper is organized as follows. 
Section 2 discusses related works, where we mainly 
focus on the existing works exploiting feature or class 
relationships. Section 3 elaborates the proposed rDNN 
approach. Extensive experimental results and com¬ 
parisons with several baseline methods and recent 
state of the arts are discussed in Section 4, where we 
also provide a brief introduction of the new FCVID 
dataset. Finally, Section 5 concludes this paper. 

2 Related Work 

Video categorization has received significant research 
attention. Most approaches followed a very standard 
pipeline, where various features are first extracted and 
then used as inputs of classifiers. Many works have 
focused on the design of novel features, such as the 
Spatial-Temporal Interest Points (STIP) |5l, trajectory- 
based descriptors [21, audio clues f lOj, and the Convo¬ 
lutional Neural Networks (CNN) based features HI, 

El, im, 0. 

In contrast to the variety of video features. Sup¬ 
port Vector Machines (SVM) have been the dominate 
classifier option for over a decade. Recently, with 
the increasing popularity of the deep learning based 
approaches, neural networks have also been adopted 
for video classification |!5|, lUTl , |i61. Among them, 
the best deep learning based video categorization 
result was probably from Simonyan and Zisserman 
[61, who used a two-stream CNN approach to extract 
features from static frames and motion optical flow 
respectively. The features were classified separately 
and the predictions were then linearly fused. Using 
this pipeline, they reported similar performance to the 
improved dense trajectories [2|, one of the best hand¬ 
crafted feature-based approaches. Besides accuracy, 
efficiency is another important factor that should be 
considered in the design of a modern video classifica¬ 
tion system. Several recent studies investigated this is¬ 
sue by proposing efficient classification methods 1121 , 
1131 or parallel computing strategies Ell, EH- 

In the following we primarily discuss works on 
multi-feature fusion and/or exploiting class relation¬ 
ships, which are more closely related to this work. 

2.1 Exploiting Feature Reiationships 

In most state-of-the-art video categorization systems, 
two naive feature fusion strategies were adopted, 
i.e., the early fusion and the late fusion. Although 


both methods cannot exploit the hidden feature re¬ 
lationships like the correlations of different feature 
dimensions, they are widely used due to simplicity 
and good generalizability. Fusion weights are needed 
in both methods to weigh the importance of each 
individual feature dimension, which can be set as 
equal values (a.k.a. average fusion) or learned based 
on cross validation. In several recent works, multiple 
kernel learning (MKL) E^ was adopted to estimate 
the fusion weights EtI , ESl . MKL was reported to 
produce better performance in some cases, but the 
gain was also often observed to be insignificant E9l . 

Several more advanced feature fusion approaches 
were recently proposed. In (201 , Ye et al. proposed 
an optimization framework, called robust late fusion, 
which uses a shared low-rank matrix to remove noises 
in certain feature modalities. This requires to itera¬ 
tively compute the singular value decomposition, and 
therefore is less scalable for large scale applications 
in high dimensional spaces. In another work by Liu 
et al. I 21 I, dynamic fusion was adopted to identify 
the best feature combination strategy for each sam¬ 
ple. This approach was shown to be effective but is 
extremely expensive. In |22j, Jiang et al. focused on 
exploiting the correlations between audio and visual 
features. They proposed to generate an audio-visual 
joint codebook by discovering the correlations of the 
two features for video classification. The approach 
represents a promising direction as this is one of 
the very few works performing deep exploitation of 
feature correlations. The visual features used in this 
work, however, were computed on the segmented 
patches of video frames, which is computationally 
expensive as segmentation is not an easy task. The 
work was further extended in 1231 , where the tempo¬ 
ral interaction of the audio and visual features was 
exploited. Jhuo et al. Il24l also followed a similar 
framework, and improved the speed of training the 
audio-visual codebook by replacing the segmentation- 
based region features with standard local features like 
the SIFT ||25|. 

With the growing popularity of the DNN, a few 
recent studies focused on combining multiple features 
in neural networks, which are closely related to this 
work. A deep de-noised auto-encoder was employed 
in 1261 to learn a shared representation based on mu- 
timodal inputs. Similarly, a deep Boltzmann machine 
was utilized in Ea to fuse visual and textual features. 
Very recently, Kihyuk et al. ||28| proposed to learn a 
good shared representation by minimizing variation 
of information, so that missing input modality can 
be better predicted based on the available informa¬ 
tion. They showed that this method outperforms fQ7\ 
on several image classification benchmarks. Different 
from ED, Ea that fused the features in a "free" 
way without imposing any learning or optimization 
process, in this paper we propose regularized fusion of 
multiple features, which is intuitively reasonable and 
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empirically effective. Compared with E3, our objec¬ 
tive is to identify dimension-wise feature correlations. 
Minimizing the variation of information in 1281 might 
be more suitable for images, but for videos, different 
modalities (e.g, audio and visual) may represent very 
distinctive information and simply minimizing their 
variation may not be a good strategy to exploit the 
complementary information. 

2.2 Exploiting Class Relationships 

Many researchers have investigated class relation¬ 
ships, commonly termed context, to improve classi¬ 
fication performance. In 1291 , Torralba et al. discussed 
the importance of context in the task of object detec¬ 
tion in images. In 1301 , 13111 , the class co-occurrence 
context was utilized to improve object recognition 
accuracy. For video classification, Jiang et al. l32l pro¬ 
posed a semantic diffusion algorithm to harness the 
class relationships. The algorithm has the capability of 
domain adaptation. In other words, it can adjust pre¬ 
defined class relationships based on data distribution 
of different domain from the training set. Weng et 
al. l33l proposed a similar domain-adaptive method 
that not only used the class relationships, but also 
explored temporal context information of broadcast 
news videos. Recently, Deng et al. l34l proposed 
Hierarchy and Exclusion (HEX) graphs, which can 
capture not only the co-occurrence class relationships, 
but also mutual exclusion and subsumption. Another 
two recent works ||35l, f36i| utilized the co-occurrence 
statistics to help video classification, where the co¬ 
occurrence of classes was used more as a semantic 
feature representation. 

Most of these approaches, however, reply on the 
co-occurrence statistics of the video classes, and thus 
cannot be used in the cases where the classes share 
certain commonalities but do not explicitly co-occur in 
the same video. Our approach can automatically learn 
and utilize such commonalities using a regularized 
DNN with a rigorous formulation. 

Our formulation is partly inspired by recent re¬ 
search on Multiple Task Learning (MTL) f37\ , |38l. 
MTL trains multiple class models simultaneously and 
boosts the performance of a task (a classifier model) 
by seeking help from other related tasks. MTL has 
demonstrated good results in many applications, such 
as disease prediction ||39l, |40| and financial stock 
selection 1411 . Sharing certain commonalities among 
multiple tasks is the key idea of MTL and several 
algorithms have been proposed with regularizations 
on the shared patterns across tasks 113 , B43l , l44l . 
These works exploited the class relationships in classi¬ 
fication or regression problems using the conventional 
learning approaches, but never injected such regular¬ 
izations into the DNN. 

In fact, neural network is one of the earliest MTL 
models l45l . See Figure |^b) for a standard network 


structure. In the network, each unit of the output layer 
refers to a task (class) and neurons of the hidden lay¬ 
ers can be viewed as the shared common features. In 
this paper, we show that, by imposing explicit forms 
of regularizations, the class relationships can be better 
exploited for video classification, and thus superior 
performance over the traditional neural network with 
implicit task sharing can be attained. 

3 Regularized DNN 

In this section, we elaborate the details of the pro¬ 
posed regularized DNN for video classification. We 
start from introducing the notations and settings. 

3.1 Notations and Settings 

We have a training set with a total of N video samples, 
which are associated with C semantic classes. Since a 
video sample may have M types of feature represen¬ 
tations (e.g., multiple visual and audio clues), we can 
use an (M + l)-tuple to represent each video as: 

,x^,y„),n = I,-- - 

Here represents the m-th feature of the n-th video 
sample, and y„ = [y„i, • • • Vnc, • • • VncV ^ is the 

associated semantic label for the n-th sample. If the 
n-th sample belongs to the c-th semantic class, the c- 
th element is set as ync = 1, otherwise ync = 0. The 
objective for video classification under the above set¬ 
ting is to train prediction models that can categorize 
new test videos into the C semantic classes. 

Simply, one can independently train one classifier 
for each semantic class, where different features can 
be combined using either the early or the late fu¬ 
sion scheme. However, such an independent training 
strategy does not exploit the feature or the class 
relationships and it often requires a large amount of 
training samples for each video class. To address these 
limitations, we propose a DNN framework with struc¬ 
ture regularization to perform video classification. In 
particular, this regularized DNN carries out feature 
fusion with an additional layer, namely fusion layer, 
to exploit the correlation and diversity of multiple 
features, as illustrated in Figure Furthermore, we 
impose additional regularization on the prediction 
layer to enforce knowledge sharing across different se¬ 
mantic classes. With such a regularized DNN frame¬ 
work, we are able to explicitly explore both types 
of relationships in a uniform learning process. To 
address the details of the proposed regularized DNN, 
below we first introduce the background of training 
standard DNNs with a single type of feature. 

3.2 DNN Learning with A Single Type of Feature 

Inspired by the biological neural systems, DNN uses a 
large number of interconnected neurons and construct 
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Fig. 2. Popular neural network structures: (a) is the standard one-vs-all training scheme using a single type of 
feature; (b) is the popular structure used in multi-class learning with a single type of feature; (c) is the one-vs-all 
training scheme using multiple types of features; and (d) is a recently proposed neural network structure that 
processes multiple features separately and then performs fusion using a middle layer [27]. 


complex computational models to mimic the informa¬ 
tion processing in neural systems. Through cascading 
the neurons in multiple layers, DNN exhibits strong 
non-linear abstraction capacity and is able to learn 
arbitrary mapping from inputs to outputs as long as 
being given sufficient training data. 

Given a DNN with L layers, we denote a^_i and 
ai as the input and the output of the /-th layer, I = 
1, • • • ,1/, while WI and hi refer to the weight matrix 
and the bias vector of the /-th layer, respectively. With 
only a single type of feature, the transition function 
from the (/ — l)-th layer to the /-th layer can be written 
as: 




a (W/_ia/_i + b/_i) / > 1; 

X / = 1, 


( 1 ) 


where the nonlinear sigmoid function cr(') is defined 
as: 


For simplicity, we can absorb b/_i into the weights 
coefficient W/_i by adding an additional dimension 
to the feature vectors with a constant value one. 
Figure (a) and (b) show two types of four-layered 
neural networks using a single feature as the input to 
classify data samples into C semantic classes. 

Typically, one can minimize the following cost func¬ 
tion to derive the optimal weights for each layer in the 
network: 


N , L -1 

imn y;4/(xi),yi)+ y y; iiW(iii,. (2) 

i=l Z=1 


The first part in the above cost function measures the 
empirical loss on the training data, which summarizes 
the discrepancy between the outputs of the network 
fi = = /(^i) ^rid the ground-truth labels y^. 

The second part is a regularization term preventing 
overfitting. 


3.3 Regularization for Feature Fusion 

The DNN using a single type of feature is effective in 
some cases. However, for data with a variety of rep¬ 
resentations like videos, the semantics could be car¬ 
ried by different feature representations. For instance, 
some video semantic classes strongly link to the visual 
effects and others are more relevant to the audio clues. 
Simple fusion strategies like the early or the late fu¬ 
sion usually have limited performance improvement 
since the intrinsic relations among multiple feature 
representations are often overlooked In addition, 
such simple fusion methods often incur extra efforts 
for training the classifiers. Therefore, it is desired to 
obtain a compact yet effective fused representation 
that leverages the complementary clues from various 
features. 

Motivated by the multisensory integration process 
of primary neurons in biological systems II 471 . 1481 , we 
extend the basic DNN with structure regularization 
on an additional fusion layer to accommodate the 
deep fusion process of multiple types of features. As 
demonstrated in Figure the fusion layer absorbs all 
the outputs from the transformation layer to generate 
an integrated representation as the input for the classi¬ 
fication layer. Accordingly, the transition equation for 
this fusion layer can be written as the following: 

= <7 [ ^ + bs") . (3) 

\m=l / 

We denote E as the index of the last layer of feature 
transformation and F as the index of the fusion layer 
(i.e., F = E 1). Hence, represents the extracted 

mid-level representation for the m-th feature. From 
the above transition equation, the mid-level repre¬ 
sentation is first linearly transformed by the weight 
matrix and then non-linearly mapped to generate 
the fused representation ap using a sigmoid function. 









































































6 


Since the M feature representations reveal various 
perspectives of the same video data, it is under¬ 
standable that all these features can be used to learn 
the common latent semantic patterns. In addition, 
different types of features can also be complementary 
to each other since they have distinct clues and char¬ 
acteristics. Hence, the objective for the fusion process 
should capture the relations among the features, while 
being able to preserve their uniqueness. Different 
from most of the straightforward fusion strategies, we 
specifically formulate an optimization problem with 
structure regularization on the weights of the fusion 
layer. Such a regularized DNN encourages the fusion 
process to explore correlations and diversities among 
the multiple features simultaneously. 

Note that the weights of the fusion layer, 
W^,-" ,W^, transform all the available features 
into a shared representation. Here the weight ma¬ 
trices are first vectorized into P dimensional vectors 
separately with P = |a^| ■ I ai?| being the product of 
the a^'s (m = 1, • • • , M) dimension and the ai?'s di¬ 
mension. To simplify the formulation, we assume the 
extracted features a^ are of the same dimension. Then 
all the coefficient vectors are stacked into a matrix 
W^; G Each column of W^; corresponds to the 

weights of a single feature with the element ^E{hj) 
given as 

In order to perform feature fusion by exploring cor¬ 
relations and diversities simultaneously, we formulate 
the following regularized optimization problem to 
learn the weights of the DNN: 

./EM L-1 \ 

£+y EEiiwriii. + EiiwHil 

V;=lm=l l=F ) 

+ ytr(Ws^-'Wi) 

s.t. ^ E 0, 

where L = is the empirical loss term. 

Different from the standard single feature based neu¬ 
ral network (cf. Equation]^, we include one additional 
regularization term in the above cost function with 
one more variable ^ G to model the inter¬ 

feature correlation. Note that ^ a symmetric and 
positive semidefinite matrix and the last regulariza¬ 
tion term with the trace norm can help learn the 
inter-feature relationship ||38l, ||49|. The entries with 
large values in ^ indicate strong feature correla¬ 
tions, while small-valued entries denote the diversity 
among different features since they are less correlated. 
The coefficients Ai and A 2 balance the contributions 
from different regularization terms. Finally, we can 
introduce a joint minimization procedure over the 
weight matrix W and the feature correlation matrix 
^ to train the regularized DNN. 


3.4 Regularization for Ciass Knowiedge Sharing 

As discussed earlier, one can simply adopt the one- 
vs-all strategy to independently train C classifiers 
for categorizing video semantics. As illustrated in 
Figure |^a) and |^c), this one-vs-all training scheme 
with a total of C four-layered neural networks can 
be applied for both single-feature and multi-feature 
settings. To train a total of C neural networks sepa¬ 
rately, a sufficient amount of positive training samples 
are desired for each video category. In addition, the 
independent training process completely neglects the 
knowledge sharing among different semantic cate¬ 
gories. However, video semantics often share some 
commonality due to the strong correlations between 
different semantic categories, which have been ob¬ 
served in previous studies 1321, ll5Ql , IISTI . Therefore, 
it is critical to explore such a commonality by simul¬ 
taneously learning multiple video semantics, which 
can lead to better learning performance 1511 . Gen¬ 
erally, the commonality among multiple classes is 
represented by the parameter sharing among different 
prediction models Il52l , Il53l . In addition, it is fairly 
natural for DNN to perform simultaneous multi-class 
training. For example, as seen in Figure |^b), by 
adopting a set of C units in the output layer, a single¬ 
feature based DNN can be easily extended to multi¬ 
class problems. 

Motivated by the regularization methods adopted 
for MTL (521, (531, here we present a regularized 
DNN that aims at training multiple classifiers si¬ 
multaneously with deeper exploitation of the class 
relationships. To enforce class knowledge sharing, we 
employ the following optimization problem as our 
learning objective: 

min E yi) + y E 11''^* IIf 

i=i i=i (5) 

+ A2tr(WL_if2-iwGi)- 

S.t. ri E 0. 

Although some previous MTL works explore similar 
regularization in the learning objective, they often 
assume that the class relationships are explicitly given 
and are ready for use as prior knowledge (531 , (38l . 
In our formulation, we tend to learn the prediction 
model as well as the class relationships. In particular, 
we adopt a convex formulation by imposing a trace 
norm regularization term over the coefficients of the 
output layer Wl_i with the class relationships aug¬ 
mented as a matrix variable Pt G The constraint 

n E 0 indicates that the class relationship matrix is 
positive semidefinite since it can be viewed as the 
similarity measure of the semantic classes. The coef¬ 
ficients Ai and A 2 are regularization parameters that 
balance the contributions from different regularization 
terms. 
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3.5 Final Objective of rDNN 

Considering both objectives of feature fusion and class 
knowledge sharing, we now present a unified DNN 
formulation that is able to explore both the feature and 
the class relationships. In this joint framework, one 
additional layer is employed to fuse multiple features, 
where the objective is to bridge the gap between 
low-level features and the high-level video seman¬ 
tics. Then another layer of neurons is staked over 
the fusion layer to generate the predictions, where 
we impose the trace norm regularization over the 
prediction models to encourage knowledge sharing 
across different semantic categories. To build such a 
rDNN, we incorporate both the feature regularization 
in Equation and the class regularization in Equa¬ 
tion 1 ^ to form the following objective: 


min 

w,^,o 


./EM L-1 \ 

EEiiwriiF + Eiiw;iiF 

\z=l m=l l=F / 

s.t. ^ E 0 tr(^) = 1, 

flhO tr{n) = 1 , 


( 6 ) 


where Ai,A 2 , and A 3 are regularization parameters. 
In the above formulation, two trace-norm regular¬ 
ization terms are tailored for the fusion of multiple 
features and the exploitation of the class relationships, 
respectively. In addition, we impose two additional 
constraints tr(^) = 1 and = 1 to restrict the 

complexity, as suggested in 13^ . In the next section, 
we introduce an alternating optimization strategy to 
minimize the above cost function with respect to the 
network weights {W/}/Ei/ the feature relationship 
matrix as well as the class correlation matrix fl. 


3.6 Optimization and Analysis 

Eor the optimization problem in Equation two 
pairs of variables, i.e., and (WL_i,ri), are 

coupled with each other. Here we adopt an alternat¬ 
ing optimization approach to iteratively minimize the 
cost function with respect to (/ = 1 ,•••!/,m = 
1 , • • • , M), ^ and fl. 

We first consider the minimization problem over the 
network weight matrix W[^ with fixed ^ and fi. It is 
easy to see that the original problem is degenerated 
to the following unconstrained optimization problem: 

./EM L-1 \ 

£Eiiwrii^ + EiiwdiF 

^ Vz=l m=l l=E J 

( 7 ) 


Since all the terms in the above cost function are 
smooth, the gradient can be easily evaluated. Let 
be the gradient with respect to W[^. We have the 
following update equation for the weight matrix W[^: 


W[^ = ( 8 ) 


where r] is the step length of the gradient descent. 

We then introduce the solution for minimizing the 
cost function over ^ with other variables being fixed. 
The problem in Equation can be rewritten as: 


rmn tr(W£;^-^W^), 
s.t. ^ E 0 tr(^) = 1 . 


(9) 


By adopting the Cauchy-Schwarz inequality, we ob¬ 
tain the analytical solution for the above optimization 
problem as: 


(W^WE)i 

tr((WgWB)3)‘ 


( 10 ) 


Similarly, we can derive the optimal solution for Q as: 


(wy,WL-i)^ 

tr((WyiWi_i)^)' 


( 11 ) 


Note that Zhang et al. adopted a similar solution to 
identify task correlations for a linear kernel based re¬ 
gression and classification problem 1381 . However, our 
method integrates more complex structure regulariza¬ 
tions in a neural network architecture, where both the 
feature and the class relationships are exploited for a 
completely different application. 

In summary, we first estimate the feature and class 
relationships using the weights in the neural network. 
The relationship matrices are then utilized in turn to 
refine the network weights to improve the classifica¬ 
tion performance. Due to the existence of analytical 
solutions, we are able to learn the relationship matri¬ 
ces ^ and ft in an efficient way. Einally, the training 
procedure of the proposed rDNN is summarized in 
Algorithm In each epoch, we need to compute 
the gradient matrix G[^ for updating W[^, and then 
update the matrices ft and The complexity of 
calculating the trace norms is the same as that of 
the £2 norm. The update of ft and ^ requires cubic- 
complexity operations with respect to the number of 
features M and the number of video classes C. In 
practical large scale settings, the values of M and 
C are often significantly smaller than the number of 
training samples. Therefore, the training cost of the 
proposed rDNN is very similar to that of a standard 
DNN. Our empirical study further confirms the effi¬ 
ciency of our method, as will be discussed later. 




Algorithm 1 Training Procedure of rDNN. 

Require: x^: the representation of the m-th feature for the 
n-th video sample; 

Yni the semantic label of the n-th video sample; 

1: Initialize W[^ randomly, ^ — ^1m and tt — ^Ic, 
where Im and Ic are identity matrices; 

2: for epoch — 1 io K do 

3: Back propagate the prediction error from layer L to 

layer 1 by evaluating the gradient G^, and update 
the weight matrix for each layer and each feature 
as: 

wr = wr-77Gr; 

4: Update the feature relationship matrix ^ according 

to Equation 

.J. (WjWg)^ 

tr((WiW£;)i)’ 

5: Update the class relationship matrix ^ according to 

Equation 

^ (wLiW^-i)i 
tr((WLiWi_i)i)' 

6: end for 


4 Experiments 

4.1 Experimental Setup 

4. 1.1 Dataset and Evaluation 

We adopt three challenging datasets to evaluate the 
rDNN, as described in the following. 

Hollywood2 |9|. The Hollywood2 dataset is well- 
known in the area of human action recognition in 
videos. Collected from 69 Hollywood movies, it con¬ 
tains 1,707 short video clips annotated according to 12 
classes: answering phone, driving car, eating, fighting, 
getting out of car, hand shaking, hugging, kissing, 
running, sitting down, sitting up and standing up. 
Following ||9l, the dataset is split into a training set 
with 823 videos and a test set with 884 videos. 

Columbia Consumer Videos (CCV) 1541 . The CCV 
dataset is a popular benchmark on Internet consumer 
video categorization. It contains 9,317 videos col¬ 
lected from YouTube with annotations of 20 seman¬ 
tic categories, including objects (e.g., "cats"), scenes 
(e.g., "playground"), and events (e.g., "parade"). Since 
many categories are events, it requires a joint use of 
multiple feature clues like visual and audio represen¬ 
tations to perform better categorization. The dataset 
is evenly split into a training set and a test set. 

Fudan-Columbia Video Dataset (FCVID). Since 
both the Hollywood2 and the CCV datasets are small 
in terms of the number of annotated classes and 
the number of videos, to substantially evaluate our 
rDNN, we collect and release a new benchmark, 
named FCVIE^ This dataset contains 91,223 Internet 
videos annotated manually according to 239 cate¬ 
gories, covering a wide range of topics like social 

1. Available at: http://bigvid.fudan.edu.cn/FCVID/ 


events (e.g., "tailgate party"), procedural events (e.g., 
"making cake"), objects (e.g., "panda"), scenes (e.g., 
"beach"), etc. We divide the dataset evenly with 
45,611 videos for training and 45,612 videos for test¬ 
ing. To the best of our knowledge, FCVID is one of the 
largest datasets for video categorization with accurate 
manual annotations. Please refer to the appendix for 
more information of the dataset, including details 
on the collection and annotation process, statistics, a 
category hierarchy, as well as other related released 
resources (e.g., all the computed features used in this 
work). 

For all the three datasets, we adopt average pre¬ 
cision (AP) to measure the performance of each cate¬ 
gory and report mean AP (mAP) as the overall results 
of all the categories. 

4.1.2 Video Features 

As aforementioned, we consider both deeply learned 
features and hand-crafted features in this work. 

Static CNN Features. Recently, CNN has exhibited 
top-notch performance in various visual categoriza¬ 
tion tasks, particularly in the image domain Il55l . We 
adopt a CNN model pre-trained on the ImageNet 2012 
Challenge data, which consists of 1.2 million images 
and 1,000 concept categories. For a given video frame, 
we extract a 4,096-d feature representation (CNN-/C 7 ), 
which is the output of the 7th fully connected layer 
as suggested in 15^ . Finally, the frame-level features 
are averaged to generate a video-level representation. 

Motion Trajectory Features |2l. The dense trajec¬ 
tory features [ 2 J have been popular for several years, 
which have exhibited strong performance on various 
video categorization datasets. Densely sampled local 
frame patches are first tracked over time to generate 
the dense trajectories. For each trajectory, four descrip¬ 
tors are computed based on local motion pattern and 
the appearance around the trajectory, including a 30- 
d trajectory shape descriptor, a 96-d histogram of ori¬ 
ented gradients (HOG) descriptor, a 108-d histogram 
of optical flow (HOF) descriptor, and a 108-d motion 
boundary histogram (MBH) descriptor. Finally, each 
type of descriptor is quantized into a 4,000-d bag-of- 
words representation, following the settings of [ 21 . 

Audio Features. The audio soundtracks contain 
very useful clues that can help categorize some video 
semantics. Two types of video features are considered 
in this work. The first one is the popular MFCCs (Mel- 
Frequency Cepstral Coefficients), which are computed 
over every 32ms time-window with 50% overlap 
and then quantized into a bag-of-words represen¬ 
tation. The second one is called Spectrogram SIFT 
(sgSIFT) inzl, where we transform the 1-d soundtrack 
of a video into a 2-D image based on the constant-Q 
spectrogram. Standard SIFT descriptors are extracted 
from this spectrogram and quantized into a bag-of- 
words representation. 
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All the bag-of-words representations are normal¬ 
ized with RootSift 1581, which has been shown to 
be more suitable for histogram-based features than 
the conventional L2 normalization. The CNN-based 
representation is directly used without further nor¬ 
malization. 

4.1.3 Alternative Approaches for Comparison 
To verify the effectiveness of our rDNN, we compare 
with the following approaches: 

1) DNN. The same structure with the rDNN, but no 
regularization is imposed. 

2) Early Fusion with Neural Networks (NN-EF). 
All the features are concatenated into a long 
vector and then used as the input to train a neural 
network for video categorization. 

3) Late Fusion with Neural Networks (NN-LF). 
A neural network is trained using each feature 
representation independently. The outputs of all 
the networks are fused to obtain the final catego¬ 
rization results. 

4) Early Fusion with SVM (SVM-EF). The popular 

kernel SVM is adopted and the features are 
combined on the kernel level before classification. 

5) Late Fusion with SVM (SVM-LF). An SVM 
classifier is trained for each feature and prediction 
results are then combined. 

6 ) Multiple Kernel Learning (SVM-MKL). We per¬ 
form feature fusion with the ^^-Norm MKL 1591 
by fixing p = 2. MKL is able to learn dynamic 
fusion weights. For the above EF/LF approaches 
1-4, we adopt equal fusion weights. 

7) Multimodal Deep Boltzmann Machines (M- 
DBM). M-DBM is a fusion approach proposed 
in [27], where multiple feature representations 
are used as the inputs of the Deep Boltzmann 
Machines. 

8 ) Discriminative Model Fusion (DMF). DMF II 6 Q | 
is one of the earliest approaches for exploiting 
the inter-class relationships. It simply uses the 
outputs of an initial classifier, e.g., a DNN in our 
case, as the features to train an SVM model as 
the second level classifier to generate the final 
prediction. The second level SVM is expected to 
be able to learn and use the class relationships. 

9) Domain Adaptive Semantic Diffusion (DASD). 
DASD 1321 uses a graph diffusion formulation 
to utilize the inter-class relationships for visual 
categorization. Similar to DMF, the prediction 
outputs of a DNN (without the regularizations) 
are used as the inputs of the DASD in a post¬ 
processing refinement step. The approach re¬ 
quires inputs of pre-computed class correlations, 
which can be estimated based on statistics of 
label co-occurrences in the training data. Notice 
that the pre-computed class correlations are not 
needed by our rDNN, which can automatically 
learn the relationships. 


Among the alternative approaches, 2—7 focus on 
feature fusion, while the last two focus on the use of 
the class relationships. All the neural networks based 
experiments are conducted on a single NVIDIA Telsa 
K20 5GB GPU with the MATLAB Parallel Computing 
Toolbox. 

4.2 Results and Discussion 

We now report and discuss experimental results. In 
order to understand the contributions of only exploit¬ 
ing the feature and the class relationships, we first 
test the performance of the rDNN by disabling the 
regularizations on the output layer and the fusion 
layer, respectively. This also ensures to make fair 
comparisons with the alternative approaches. After 
that, we enable regularizations on both layers and 
report results of the entire rDNN framework. With 
this setting, we analyze the effect of the number of 
training samples, and compare with recent state-of- 
the-art results. Last, we discuss the computational 
efficiency of rDNN. 

Throughout the experiments, we set the learning 
rate of the neural networks to 0.7, fix Ai to a small 
value of 3 X 10“^ in order to prevent overfitting, and 
tune A 2 and A 3 in the same range as Ai. We adopt the 
mini batch gradient descent with the batch size being 
70 for network training. 

4.2.1 Exploiting Feature Relationships 
We first report results by only using the fusion layer 
regularization in our rDNN, namely rDNN-Feature 
Regularization (rDNN-F). Table shows the results 
of the individual features, our rDNN-F, and the al¬ 
ternative feature fusion methods. Among the static 
CNN, motion and audio features, motion is signif¬ 
icant better than the other two on Hollywood 2 but 
is slightly worse than the CNN feature on CCV and 
FCVID. This is due to the fact that many classes 
in CCV and FCVID (e.g., "baseball" and "desert") 
can be recognized by viewing just one or a few 
discrete frames, but categorizing the Hollywood 2 hu¬ 
man actions normally requires a sequence of frames 
with detailed motion clues. In addition, the overall 
performance on CCV is slightly lower than that on 
the much larger FCVID. This is because CCV has some 
highly correlated categories (cf. Figure]^ that are very 
difficult to be separated. While FCVID also contains 
similar confusing categories, the percentage of such 
"difficult" cases is lower as it also has more "easy" 
categories, and therefore the overall performance is 
higher. 

For the fusion of the three types of features, our 
rDNN-F achieves the best performance with consis¬ 
tent gains over all the compared methods. Note that, 
like the "DNN" baseline, the M-DBM approach also 
utilizes a neural network for feature fusion, but in a 
free manner without explicitly enforcing the use of the 
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TABLE 1 

Performance comparison (mAP) using individual and 
multiple features with various fusion methods. 
“rDNN-F” indicates our rDNN focusing only on the 
exploitation of the feature relationships. 



HollywoodZ 

CCV 

FCVID 

Static CNN 

40.1% 

66.1% 

63.8% 

Motion 

62.4% 

60.8% 

62.8% 

Audio 

22.7% 

25.9% 

26.1% 

DNN 

64.2% 

71.6% 

72.1% 

NN-EF 

63.5% 

70.2% 

74.7% 

NN-LF 

60.2% 

69.9% 

73.8% 

SVM-EF 

64.1% 

71.7% 

75.0% 

SVM-LF 

62.7% 

69.1% 

72.1% 

SVM-MKL 1^ 

63.8% 

71.3% 

75.2% 

M-DBM l27l 

63.9% 

71.1% 

74.4% 

rDNN-F 

65.9% 

72.9% 

75.4% 


feature relationships. These results clearly verifies the 
effectiveness of imposing the proposed fusion regu¬ 
larization method. Notice that, since the Hollywood2 
and the CCV datasets have been widely used, an 
absolute mAP gain of 2% is generally considered as a 
significant improvement. 

Among the alternative approaches, early fusion 
methods tend to produce better results than late 
fusion. This is consistent with the observations of 
several recent works, where early fusion is more pop¬ 
ularly adopted |3. The MKL is even slightly worse 
than early fusion on Hollywood2 and CCV, indicating 
that the learned weights do not generalize well to 
testing data. In addition, for the contribution of the 
audio feature in the fusion experiments, we observe 
clearly improvement for the classes with strong audio 
clues, such as ''answering phone". On the contrary, 
for classes like "sitting down", audio features may 
slightly degrade the result. 

4.2.2 Exploiting Class Relationships 

Next, we report results of rDNN using only the class 
relationships, namely rDNN-C. We compare with 
the DNN baseline with no regularization, DMF and 
DASD. Results are given in Table rDNN-C outper¬ 
forms the DNN baseline and the two alternative ap¬ 
proaches. Both DMF and DASD use the outputs of the 
DNN baseline as inputs for context-based refinement. 
These results corroborate the effectiveness of the class 
relationship regularization. 

Note that the DASD requires pre-computed class 
relationships as the input, which are estimated based 
on the label co-occurrences in the training data. This 
might be the reason that it performs worse than the 
rDNN-C as the latter automatically learns the com¬ 
monalities shared among the categories. The learn¬ 
ing process can identify not only the categories that 
co-occur, but also those sharing visual or auditory 
commonalities but rarely appear together. To verify 
this, we visualize some found category groups in 


TABLE 2 

Performance comparison (mAP) with DMF and DASD, 
which focus on the use of the class relationships. 

“DNN” is a baseline without imposing any 
regularization and “rDNN-C” indicates our rDNN 
utilizing only the class relationships. 



Hollywood2 

CCV 

FCVID 

DNN 

64.2% 

71.6% 

72.1% 

DMF l60l~ 

61.8% 

71.1% 

72.5% 

DASD l32l~ 

64.4% 

71.7% 

72.8% 

rDNN-C 

65.1% 

72.1% 

74.4% 


Figure As discussed in Section 3, values in the 
matrix can indicate the learned relationships among 
the categories. Hence, we apply the spectral clustering 
algorithm on to group the categories and visualize 
several groups having high within-group category 
similarities. We see that many categories are grouped 
together because they share certain commonalities, 
not due to high frequencies of co-occurrence. 

4.2.3 Exploiting Both Kinds of Relationships 

Finally, we discuss the results of the entire rDNN 
framework, using both the feature and the class re¬ 
lationships. To better evaluate the effectiveness of 
rDNN, we plot the performance with different num¬ 
bers of training samples in Figure Overall, sub¬ 
stantial performance gains are attained from the pro¬ 
posed approach. Using regularizations on both kinds 
of relationships leads to clearly higher performance 
than imposing the regularization on a single type of 
relationship. In addition, comparing the results across 
the three datasets using all the training samples, the 
improvement from exploiting the class relationships 
is more significant on FCVID. This is because FCVID 
contains a much larger number of classes that share 
commonalities helpful for categorization. Figure 
further visualizes the confusion matrices of rDNN on 
the Hollywood and the CCV datasets. 

We also observe that the performance gain of rDNN 
is more significant when the number of training sam¬ 
ples is small (except the case of 10 training samples 
on FCVID, which are too few to distinguish the 239 
categories). Under all the settings, the rDNN requires 
much less training data to achieve comparable results 
with the non-regularized version, which is a very 
appealing property desired in practical applications. 

4.2.4 Comparison with State of the Arts 

We compare rDNN with several state-of-the-art ap¬ 
proaches in Table On Hollywood2, our proposed 
method achieves a very competitive mAP of 66.9%, 
outperforming most of the compared approaches ||6T|, 
El, 0, El, El, HI, except a very recent result 
from Lan et al. (651. Almost all these approaches are 
based on the popular dense trajectory features and 
the SVM classification with the simple early fusion 
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Fig. 3. Example frames of a few automatically found category groups (circled) in the FCVID dataset. Many of 
the found groups contain categories that share visual/auditory commonalities but do not necessarily co-occur. 
Discernible faces are blurred due to privacy concern. 



# training samples # training samples # training samples 

Hollywood2 CCV FCVID 


Fig. 4. Performance on the three datasets using different number of training samples. We plot the results of 
the DNN baseline without regularization, rDNN-F, rDNN-C and the rDNN exploiting both types of relationships. 
The best mAP on the three datasets (the rDNN approach using all the training samples) are 66.9%, 73.5% and 
76.0% respectively. See texts for more discussions. 
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Fig. 5. Confusion matrices of rDNN on Hollywood2 
and CCV datasets. Some CCV categories are very 
confusing, e.g., the three wedding related ones. 


CCV 



method. Note that some of them like Wang et al. 121 , 
Oneata et al. 1621 and Lan et al. 1^ encoded the 
features using the Fisher vector |lSl/ which has been 
shown to be more effective than the classical bag- 
of-words representation used in our approach. The 
approach by Lan et al. 1^ extends upon the dense 
trajectories with a feature enhancement method called 
multi-skip feature stacking. Since our rDNN focuses 
on feature fusion and classification, we expect that 
further performance improvements can be achieved 
by jointly using rDNN with these advanced features. 

On the CCV dataset, we obtain to-date the best 
performance with an mAP of 73.5%. Most recent 
works on CCV focused on the joint use of multi¬ 
ple audio-visual features. Xu et al. l68l and Ye et 


TABLE 3 

Comparison with the state of the arts. Our rDNN 
achieves very competitive results on both the 
Hollywood2 and the CCV datasets. 


Hollywood2 

mAP 

CCV 

mAP 

Jain et al. 

62.5% 

Kim et al. Mz) 

56.5% 

Oneata et al. ]62l 

63.5% 

Xu et al. l68l~ 

60.3% 

Wang et al. |2l 

64.3% 

Ye et al. <20l 

64.0% 

Zhang et al. 1631 

50.9% 

Jhuo et al. |2^ 

64.0% 

Ni et al. l64l 

61.0% 

Ma et al. 1691 

63.4% 

Wu et al. 181 

65.7% 

Liu et al. }21\ 

68.2% 

Lan et al. 1651 

68.0% 

Wu et al. |8j 

70.6% 

rDNN 

66.9% 

rDNN 

73.5% 


TABLE 4 

Training time per epoch of three neural network based 
approaches, measured on the Hollywood2 dataset. 



Training Time (Seconds) 

NN-EF 

1.540±0.02 

NN-LF 

1.552±0.05 

rDNN 

1.276±0.10 


al- EOl extended late fusion with specially designed 
strategies to remove the noise of individually trained 
classifiers. Jhuo et al. adopted a joint audio-visual 
codebook to exploit feature relationships for catego¬ 
rization 1241 . rDNN is different from these state-of- 
the-art approaches and produces significantly better 
results. 

4.2.5 Computational Efficiency 

We discuss the computational efficiency of rDNN 
using the Hollywood2 dataset. The average training 
time of each epoch for NN-EF, NN-LF and rDNN are 
given in Table using the same GPU-based imple¬ 
mentation. rDNN is more efficient than NN-EF and 
NN-LF as it contains less parameters to be learned. 
Specifically, compared with the NN-EF, rDNN pro¬ 
cesses the features separately in the first two layers 
and thus avoids the parameters needed for interacting 
among them. The NN-LF requires the training of sep¬ 
arate networks, which is also more expensive. Note 
that the M-DBM method is not compared because it 
requires much more time to pre-train the network for 
weight initialization. For all the methods, normally 
a few hundreds of epochs are needed to finish the 
training (several minutes in total). After training, all 
the neural network based methods are extremely fast 
in testing. 

5 Conclusion 

We have proposed a novel rDNN approach to exploit 
both feature and class relationships in video catego¬ 
rization. By imposing trace-norm based regulariza¬ 
tions on the specially tailored fusion layer and output 
layer, our rDNN can learn a fused representation of 
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multiple feature inputs and utilize the commonali¬ 
ties shared among the semantic classes for improved 
categorization performance. Extensive experiments of 
action and event recognition on popular benchmarks 
have shown that rDNN consistently outperforms sev¬ 
eral alternative approaches as well as recent state of 
the arts. Our rDNN is also efficient in both model 
training and testing, which is very important for large 
scale applications. In addition, we have introduced 
a new benchmark dataset, namely FCVID, for large 
scale video categorization. FCVID is one of the largest 
datasets in the field with manual annotations, con¬ 
taining 91,223 videos carefully labeled according to 
239 categories. We believe that FCVID is helpful for 
stimulating research not only on video categorization, 
but also on other related problems. 

The current framework supports the use of any pre¬ 
computed features. One interesting future work is to 
exploit the joint learning of feature representations 
and classification models. For instance, the adopted 
CNN feature is computed based on off-the-shelf mod¬ 
els trained on Image-Net. It would be probably more 
effective if the feature extraction network could be 
further tuned simultaneously with the regularized 
classification network. 

Appendix 

FCVID: Fudan-Columbia Video Dataset 

In this appendix, we introduce the collection and 
annotation process of the FCVID, and compare it with 
several existing datasets. 

.1 Collection and Annotation 

The categories in FCVID cover a wide range of topics 
like social events (e.g., "tailgate party"), procedural 
events (e.g., "making cake"), objects (e.g., "panda"), 
scenes (e.g., "beach"), etc. These categories were de¬ 
fined very carefully. Specifically, we conducted user 
surveys and used the organization structures on 
YouTube and Vimeo as references, and browsed nu¬ 
merous videos to identify categories that satisfy the 
following three criteria: (1) utility — high relevance 
in supporting practical application needs; (2) cover¬ 
age — a good coverage of the contents that people 
record; and (3) feasibility — likely to be automatically 
recognized in the next several years, and a high 
frequency of occurrence that is sufficient for training 
a recognition algorithm. 

This definition effort led to a set of over 250 candi¬ 
date categories. For each category, in addition to the 
official name used in the public release, we manually 
defined another alternative name. Videos were then 
downloaded from YouTube searches using the official 
and the alternative names as search terms. The pur¬ 
pose of using the alternative names was to expand the 
candidate video sets. For each search, we downloaded 


1,000 videos, and after removing duplicate videos and 
some extremely long ones (longer than 30 minutes), 
there were around 1,000-1,500 candidate videos for 
each category. 

All the videos were annotated manually to ensure 
a high precision of the FCVID labels. In order to min¬ 
imize subjectivity, nearly 20 annotators were involved 
in the task, and a master annotator was assigned to 
monitor the entire process and double-check all the 
found positive videos. Some of the videos are multi- 
labeled, and thus filtering the 1,000-1,500 videos for 
each category with focus on just the single category 
label is not adequate. As checking the existence of all 
the 250+ classes for each video is extremely difficult, 
we use the following strategy to narrow down the "la¬ 
bel search space" for each video. We first grouped the 
categories according to subjective predictions of label 
co-occurrences, e.g., "wedding reception" & "wed¬ 
ding ceremony", "waterfall" & "river", "hiking" & 
"mountain", and even "dog" & "birthday". We then 
annotated the videos not only based on the target cat¬ 
egory label, but also according to the identified related 
labels. This helped produce a fairly complete label 
set for FCVID, but largely reduced the annotation 
workload. After removing the rare categories with 
less than 100 videos after annotation, the final FCVID 
dataset contains 91,223 videos and 239 categories, 
where 183 are events and 56 are objects, scenes, etc. 

Figure 6 shows the number of videos per category. 
"Dog" has the largest number of positive videos 
(1,136), while "making egg tarts" is the most infre¬ 
quent category containing only 108 samples. The total 
duration of FCVID is 4,232 hours with an average 
video duration of 167 seconds. Figure 7 further gives 
the average video duration of each category. 

The categories are organized using a hierarchy con¬ 
taining 11 high-level groups, which is visualized in 
Figure 8. 

.2 Comparison with Reiated Datasets 

We compare FCVID with the following datasets. Most 
of them have been widely adopted in the existing 
works on video categorization. 

KTH and Weizmann: The KTH IZS and the Weiz- 
mann izli datasets are well-known benchmarks for 
human action recognition. The former contains 600 
videos of 6 human actions performed by 25 people 
in four scenarios, and the latter consists of 81 videos 
associated with 9 actions performed by 9 actors. 

Hollywood Human Action: The Hollywood 
dataset m contains 8 action classes collected from 
32 Hollywood movies with a total of 430 videos. 
It was further extended to the Hollywood2 |l72l 
dataset, which covers 12 actions from 69 Hollywood 
movies totaling 1,707 videos. Compared with KTH 
and Weizmann, where videos were mostly captured 
under controlled environments with static cameras. 
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Dataset 

# Videos 

# Classes 

Year of Construction 

Background 

Manually Labeled? 

KTH 

600 

6 

2004 

Static 

Yes 

Weizmann 

81 

9 

2005 

Static 

Yes 

Kodak 

1,358 

25 

2007 

Dynamic 

Yes 

Hollywood 

430 

8 

2008 

Dynamic 

Yes 

Hollywood2 

1,787 

12 

2009 

Dynamic 

Yes 

MCG-WEBV 

234,414 

15 

2009 

Dynamic 

Yes 

Olympic Sports 

800 

16 

2010 

Dynamic 

Yes 

HMDB51 

6,766 

51 

2011 

Dynamic 

Yes 

CCV 

9,317 

20 

2011 

Dynamic 

Yes 

UCF-101 

13,320 

101 

2012 

Dynamic 

Yes 

THUMOS-2014 

18,394 

101 

2014 

Dynamic 

Yes 

MED-2014 (development set) 

?^31,000 

20 

2014 

Dynamic 

Yes 

Sports-IM 

1,133,158 

487 

2014 

Dynamic 

No 

FCVID 

91,223 

239 

2015 

Dynamic 

Yes 


TABLE 5 

Several popular benchmark datasets for video categorization, sorted by the year of construction. 


the Hollywood datasets are more challenging due to 
cluttered background and severe camera motion. 

Olympic Sports: This dataset was introduced in 
2010 lf73l L containing 800 clips and 16 action classes. 
The videos were downloaded from the Internet. 

HMDB51: The HMDB51 [Z41 dataset was collected 
from a variety of sources, such as movies and con¬ 
sumer videos on YouTube. It contains 6,766 videos 
annotated into 51 classes. 

UCF-101 & THUMOS-2014: The UCF-101 |ZJ 
dataset is another popular benchmark for human 
action recognition in videos, which consists of 
13,320 video clips (27 hours in total). There are 
101 annotated classes that can be divided into five 
types: Human-Object Interaction, Body-Motion Only, 
Human-Human Interaction, Playing Musical Instru¬ 
ments and Sports. More recently, the THUMOS-2014 
Action Recognition Challenge [75| created a bench¬ 
mark by extending upon the UCF-101 dataset (used 
as the training set). Additional videos were collected 
from the Internet, including 2,500 background videos, 
1,000 validation videos and 1,574 test videos. 

Kodak Consumer Videos: The Kodak consumer 
videos were recorded by around 100 customers of the 
Eastman Kodak Company (T^. The dataset consists of 
1,358 video clips labeled with 25 concepts (including 
activities, scenes and single objects) as a part of the 
Kodak concept ontology [76|. 

MCG-WEBV: MCG-WEBV is a large set of YouTube 
videos collected by the Chinese Academy of Sci¬ 
ences EZI- There are 234,414 web videos with anno¬ 
tations on several topic-level events like^a conflict at 
Gaza", which are too complicated to be recognized 
relying merely on content analysis. 

Columbia Consumer Videos (CCV): The CCV 
dataset was constructed in 2011, aiming to stimulate 
the research on Internet consumer video analysis ||54|. 
It contains 9,317 user generated videos from YouTube, 
which are annotated into 20 classes, including objects 
(e.g., "cat" and "dog"), scenes (e.g., "beach" and 
"playground"), sports events (e.g., "basketball" and 
"soccer") and social activities (e.g., "birthday" and 
"graduation"). 


TRECVID MED: Driven by the practical needs 
of analyzing high-level events in videos, the annual 
NIST TRECVID activity created a Multimedia Event 
Detection (MED) task since 2010 [781. Each year a 
new or an extended dataset is constructed for world¬ 
wide system comparisons. In 2014, the MED dataset 
contains 20 events, such as "birthday party","bike 
trick", etc. According to NIST, in the development 
set, there are around 8,000 videos for training and 
23,000 videos as dry-run validation samples (1,200 
hours in total). The MED dataset is only available 
to the participants of the task, and the labels of the 
official test set (200,000 videos) is not available even 
to the participants. 

Sports-IM: The Sports-IM |5l dataset was released 
in 2014, consisting of 1 million YouTube videos and 
487 classes, such as "bowling", "cycling", "rafting", 
etc. The dataset is not manually labeled. The an¬ 
notations were automatically produced by analyzing 
textual contexts of the videos. 

We further summarize and compare ECVID with 
these datasets in Table ECVID is one of the largest 
datasets in terms of the numbers of videos and cat¬ 
egories. The Sports-IM is larger but focuses only on 
sports and is not manually labeled. 

.3 Released Resources 

The dataset can be downloaded after submitting 
an agreement form. Released resources include the 
videos, labels, a standard train/test split, a cate¬ 
gory hierarchy and several pre-computed descriptors 
(CNN Il56l , SIET 1791 , Improved Dense Trajectories Q, 
and two audio features). We have also released the 
textual meta-data (e.g., tags) of the videos to support 
related research on Internet video analysis. See ECVID 
website for more details. 
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Fig. 6. The number of videos per category in FCVID. 
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